!pip3 install datashader
Requirement already satisfied: datashader in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (0.12.0) Requirement already satisfied: numba!=0.49.*,!=0.50.*,>=0.37.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from datashader) (0.52.0) Requirement already satisfied: pyct>=0.4.4 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from datashader) (0.4.8) Requirement already satisfied: colorcet>=0.9.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from datashader) (2.0.6) Requirement already satisfied: scipy in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from datashader) (1.6.1) Requirement already satisfied: xarray>=0.9.6 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from datashader) (0.16.2) Requirement already satisfied: datashape>=0.5.1 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from datashader) (0.5.2) Requirement already satisfied: toolz>=0.7.4 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from datashader) (0.11.1) Requirement already satisfied: bokeh in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from datashader) (2.3.0) Requirement already satisfied: param>=1.6.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from datashader) (1.10.1) Requirement already satisfied: pillow>=3.1.1 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from datashader) (8.1.0) Requirement already satisfied: pandas>=0.24.1 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from datashader) (1.2.2) Requirement already satisfied: dask[complete]>=0.18.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from datashader) (2021.2.0) Requirement already satisfied: numpy>=1.7 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from datashader) (1.20.1) Requirement already satisfied: llvmlite<0.36,>=0.35.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from numba!=0.49.*,!=0.50.*,>=0.37.0->datashader) (0.35.0) Requirement already satisfied: setuptools in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from numba!=0.49.*,!=0.50.*,>=0.37.0->datashader) (50.3.0.post20201006) Requirement already satisfied: multipledispatch>=0.4.7 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from datashape>=0.5.1->datashader) (0.6.0) Requirement already satisfied: python-dateutil in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from datashape>=0.5.1->datashader) (2.8.1) Requirement already satisfied: Jinja2>=2.7 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from bokeh->datashader) (2.11.2) Requirement already satisfied: typing-extensions>=3.7.4 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from bokeh->datashader) (3.7.4.3) Requirement already satisfied: packaging>=16.8 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from bokeh->datashader) (20.9) Requirement already satisfied: tornado>=5.1 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from bokeh->datashader) (6.1) Requirement already satisfied: PyYAML>=3.10 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from bokeh->datashader) (5.4.1) Requirement already satisfied: pytz>=2017.3 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from pandas>=0.24.1->datashader) (2021.1) Requirement already satisfied: partd>=0.3.10; extra == "complete" in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from dask[complete]>=0.18.0->datashader) (1.1.0) Requirement already satisfied: distributed>=2.0; extra == "complete" in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from dask[complete]>=0.18.0->datashader) (2021.2.0) Requirement already satisfied: fsspec>=0.6.0; extra == "complete" in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from dask[complete]>=0.18.0->datashader) (0.8.7) Requirement already satisfied: cloudpickle>=0.2.2; extra == "complete" in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from dask[complete]>=0.18.0->datashader) (1.6.0) Requirement already satisfied: six in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from multipledispatch>=0.4.7->datashape>=0.5.1->datashader) (1.15.0) Requirement already satisfied: MarkupSafe>=0.23 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from Jinja2>=2.7->bokeh->datashader) (1.1.1) Requirement already satisfied: pyparsing>=2.0.2 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from packaging>=16.8->bokeh->datashader) (2.4.7) Requirement already satisfied: locket in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from partd>=0.3.10; extra == "complete"->dask[complete]>=0.18.0->datashader) (0.2.1) Requirement already satisfied: zict>=0.1.3 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from distributed>=2.0; extra == "complete"->dask[complete]>=0.18.0->datashader) (2.0.0) Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from distributed>=2.0; extra == "complete"->dask[complete]>=0.18.0->datashader) (2.3.0) Requirement already satisfied: tblib>=1.6.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from distributed>=2.0; extra == "complete"->dask[complete]>=0.18.0->datashader) (1.7.0) Requirement already satisfied: msgpack>=0.6.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from distributed>=2.0; extra == "complete"->dask[complete]>=0.18.0->datashader) (1.0.2) Requirement already satisfied: click>=6.6 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from distributed>=2.0; extra == "complete"->dask[complete]>=0.18.0->datashader) (7.1.2) Requirement already satisfied: psutil>=5.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from distributed>=2.0; extra == "complete"->dask[complete]>=0.18.0->datashader) (5.8.0) Requirement already satisfied: heapdict in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from zict>=0.1.3->distributed>=2.0; extra == "complete"->dask[complete]>=0.18.0->datashader) (1.0.1)
!pip install pyproj
Requirement already satisfied: pyproj in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (3.0.0.post1) Requirement already satisfied: certifi in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from pyproj) (2020.12.5)
!pip install colorlover
Requirement already satisfied: colorlover in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (0.3.0)
!pip install plotly
Requirement already satisfied: plotly in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (4.14.3) Requirement already satisfied: six in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from plotly) (1.15.0) Requirement already satisfied: retrying>=1.3.3 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from plotly) (1.3.3)
!pip install ipywidgets
Requirement already satisfied: ipywidgets in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (7.6.3) Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from ipywidgets) (7.20.0) Requirement already satisfied: jupyterlab-widgets>=1.0.0; python_version >= "3.6" in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from ipywidgets) (1.0.0) Requirement already satisfied: widgetsnbextension~=3.5.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from ipywidgets) (3.5.1) Requirement already satisfied: traitlets>=4.3.1 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from ipywidgets) (5.0.5) Requirement already satisfied: nbformat>=4.2.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from ipywidgets) (5.1.2) Requirement already satisfied: ipykernel>=4.5.1 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from ipywidgets) (5.3.4) Requirement already satisfied: pickleshare in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (0.7.5) Requirement already satisfied: pygments in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (2.8.0) Requirement already satisfied: colorama; sys_platform == "win32" in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (0.4.4) Requirement already satisfied: jedi>=0.16 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (0.17.0) Requirement already satisfied: setuptools>=18.5 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (50.3.0.post20201006) Requirement already satisfied: decorator in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (4.4.2) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (3.0.8) Requirement already satisfied: backcall in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (0.2.0) Requirement already satisfied: notebook>=4.4.1 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from widgetsnbextension~=3.5.0->ipywidgets) (6.2.0) Requirement already satisfied: ipython-genutils in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from traitlets>=4.3.1->ipywidgets) (0.2.0) Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from nbformat>=4.2.0->ipywidgets) (3.2.0) Requirement already satisfied: jupyter-core in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from nbformat>=4.2.0->ipywidgets) (4.7.1) Requirement already satisfied: jupyter-client in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from ipykernel>=4.5.1->ipywidgets) (6.1.7) Requirement already satisfied: tornado>=4.2 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from ipykernel>=4.5.1->ipywidgets) (6.1) Requirement already satisfied: parso>=0.7.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from jedi>=0.16->ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (0.8.1) Requirement already satisfied: wcwidth in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (0.2.5) Requirement already satisfied: terminado>=0.8.3 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.9.2) Requirement already satisfied: prometheus-client in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.9.0) Requirement already satisfied: Send2Trash>=1.5.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (1.5.0) Requirement already satisfied: pyzmq>=17 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (20.0.0) Requirement already satisfied: jinja2 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (2.11.2) Requirement already satisfied: nbconvert in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (6.0.7) Requirement already satisfied: argon2-cffi in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (20.1.0) Requirement already satisfied: six>=1.11.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets) (1.15.0) Requirement already satisfied: attrs>=17.4.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets) (20.3.0) Requirement already satisfied: pyrsistent>=0.14.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets) (0.17.3) Requirement already satisfied: pywin32>=1.0; sys_platform == "win32" in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from jupyter-core->nbformat>=4.2.0->ipywidgets) (227) Requirement already satisfied: python-dateutil>=2.1 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets) (2.8.1) Requirement already satisfied: pywinpty>=0.5 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from terminado>=0.8.3->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.5.7) Requirement already satisfied: MarkupSafe>=0.23 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from jinja2->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (1.1.1) Requirement already satisfied: defusedxml in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.6.0) Requirement already satisfied: pandocfilters>=1.4.1 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (1.4.3) Requirement already satisfied: testpath in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.4.4) Requirement already satisfied: bleach in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (3.3.0) Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.5.2) Requirement already satisfied: jupyterlab-pygments in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.1.2) Requirement already satisfied: entrypoints>=0.2.2 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.3) Requirement already satisfied: mistune<2,>=0.8.1 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.8.4) Requirement already satisfied: cffi>=1.0.0 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (1.14.5) Requirement already satisfied: webencodings in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.5.1) Requirement already satisfied: packaging in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (20.9) Requirement already satisfied: nest-asyncio in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (1.5.1) Requirement already satisfied: async-generator in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (1.10) Requirement already satisfied: pycparser in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from cffi>=1.0.0->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (2.20) Requirement already satisfied: pyparsing>=2.0.2 in c:\users\mpena\anaconda3\envs\my_flask_env\lib\site-packages (from packaging->bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (2.4.7)
import datashader as ds
import datashader.transfer_functions as tf
import datashader.glyphs
from datashader import reductions
from datashader.core import bypixel
from datashader.utils import lnglat_to_meters as webm, export_image
from datashader.colors import colormap_select, Greys9, viridis, inferno
import copy
from pyproj import Proj, transform
import numpy as np
import pandas as pd
import urllib
import json
import datetime
import colorlover as cl
import plotly.offline as py
import plotly.graph_objs as go
from plotly import tools
from functools import partial
from IPython.display import GeoJSON
py.init_notebook_mode()
For module 2 we'll be looking at techniques for dealing with big data. In particular binning strategies and the datashader library (which possibly proves we'll never need to bin large data for visualization ever again.)
To demonstrate these concepts we'll be looking at the PLUTO dataset put out by New York City's department of city planning. PLUTO contains data about every tax lot in New York City.
PLUTO data can be downloaded from here. Unzip them to the same directory as this notebook, and you should be able to read them in using this (or very similar) code. Also take note of the data dictionary, it'll come in handy for this assignment.
# Code to read in v17, column names have been updated (without upper case letters) for v18
# bk = pd.read_csv('PLUTO17v1.1/BK2017V11.csv')
# bx = pd.read_csv('PLUTO17v1.1/BX2017V11.csv')
# mn = pd.read_csv('PLUTO17v1.1/MN2017V11.csv')
# qn = pd.read_csv('PLUTO17v1.1/QN2017V11.csv')
# si = pd.read_csv('PLUTO17v1.1/SI2017V11.csv')
# ny = pd.concat([bk, bx, mn, qn, si], ignore_index=True)
ny = pd.read_csv('C:/Users/mpena/Desktop/Mario/DATA608/DATA608_HW2/pluto_20v8.csv')
# Getting rid of some outliers
ny = ny[(ny['yearbuilt'] > 1850) & (ny['yearbuilt'] < 2020) & (ny['numfloors'] != 0)]
C:\Users\mpena\Anaconda3\envs\my_flask_env\lib\site-packages\IPython\core\interactiveshell.py:3155: DtypeWarning: Columns (18,19,20,21,22,23,24,26,63,76,79,86) have mixed types.Specify dtype option on import or set low_memory=False.
I'll also do some prep for the geographic component of this data, which we'll be relying on for datashader.
You're not required to know how I'm retrieving the lattitude and longitude here, but for those interested: this dataset uses a flat x-y projection (assuming for a small enough area that the world is flat for easier calculations), and this needs to be projected back to traditional lattitude and longitude.
# wgs84 = Proj("+proj=longlat +ellps=GRS80 +datum=NAD83 +no_defs")
# nyli = Proj("+proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +ellps=GRS80 +datum=NAD83 +to_meter=0.3048006096012192 +no_defs")
# ny['xcoord'] = 0.3048*ny['xcoord']
# ny['ycoord'] = 0.3048*ny['ycoord']
# ny['lon'], ny['lat'] = transform(nyli, wgs84, ny['xcoord'].values, ny['ycoord'].values)
# ny = ny[(ny['lon'] < -60) & (ny['lon'] > -100) & (ny['lat'] < 60) & (ny['lat'] > 20)]
#Defining some helper functions for DataShader
background = "black"
export = partial(export_image, background = background, export_path="export")
cm = partial(colormap_select, reverse=(background!="black"))
Binning is a common strategy for visualizing large datasets. Binning is inherent to a few types of visualizations, such as histograms and 2D histograms (also check out their close relatives: 2D density plots and the more general form: heatmaps.
While these visualization types explicitly include binning, any type of visualization used with aggregated data can be looked at in the same way. For example, lets say we wanted to look at building construction over time. This would be best viewed as a line graph, but we can still think of our results as being binned by year:
trace = go.Scatter(
# I'm choosing BBL here because I know it's a unique key.
x = ny.groupby('yearbuilt').count()['bbl'].index,
y = ny.groupby('yearbuilt').count()['bbl']
)
layout = go.Layout(
xaxis = dict(title = 'Year Built'),
yaxis = dict(title = 'Number of Lots Built')
)
fig = go.FigureWidget(data = [trace], layout = layout)
fig.show()
Something looks off... You're going to have to deal with this imperfect data to answer this first question.
But first: some notes on pandas. Pandas dataframes are a different beast than R dataframes, here are some tips to help you get up to speed:
Hello all, here are some pandas tips to help you guys through this homework:
Indexing and Selecting: .loc and .iloc are the analogs for base R subsetting, or filter() in dplyr
Group By: This is the pandas analog to group_by() and the appended function the analog to summarize(). Try out a few examples of this, and display the results in Jupyter. Take note of what's happening to the indexes, you'll notice that they'll become hierarchical. I personally find this more of a burden than a help, and this sort of hierarchical indexing leads to a fundamentally different experience compared to R dataframes. Once you perform an aggregation, try running the resulting hierarchical datafrome through a reset_index().
Reset_index: I personally find the hierarchical indexes more of a burden than a help, and this sort of hierarchical indexing leads to a fundamentally different experience compared to R dataframes. reset_index() is a way of restoring a dataframe to a flatter index style. Grouping is where you'll notice it the most, but it's also useful when you filter data, and in a few other split-apply-combine workflows. With pandas indexes are more meaningful, so use this if you start getting unexpected results.
Indexes are more important in Pandas than in R. If you delve deeper into the using python for data science, you'll begin to see the benefits in many places (despite the personal gripes I highlighted above.) One place these indexes come in handy is with time series data. The pandas docs have a huge section on datetime indexing. In particular, check out resample, which provides time series specific aggregation.
Merging, joining, and concatenation: There's some overlap between these different types of merges, so use this as your guide. Concat is a single function that replaces cbind and rbind in R, and the results are driven by the indexes. Read through these examples to get a feel on how these are performed, but you will have to manage your indexes when you're using these functions. Merges are fairly similar to merges in R, similarly mapping to SQL joins.
Apply: This is explained in the "group by" section linked above. These are your analogs to the plyr library in R. Take note of the lambda syntax used here, these are anonymous functions in python. Rather than predefining a custom function, you can just define it inline using lambda.
Browse through the other sections for some other specifics, in particular reshaping and categorical data (pandas' answer to factors.) Pandas can take a while to get used to, but it is a pretty strong framework that makes more advanced functions easier once you get used to it. Rolling functions for example follow logically from the apply workflow (and led to the best google results ever when I first tried to find this out and googled "pandas rolling")
Google Wes Mckinney's book "Python for Data Analysis," which is a cookbook style intro to pandas. It's an O'Reilly book that should be pretty available out there.
After a few building collapses, the City of New York is going to begin investigating older buildings for safety. The city is particularly worried about buildings that were unusually tall when they were built, since best-practices for safety hadn’t yet been determined. Create a graph that shows how many buildings of a certain number of floors were built in each year (note: you may want to use a log scale for the number of buildings). Find a strategy to bin buildings (It should be clear 20-29-story buildings, 30-39-story buildings, and 40-49-story buildings were first built in large numbers, but does it make sense to continue in this way as you get taller?)
We will first start by subsetting the dataframe and leaving the variables we are interested in looking at, in this case "yearbuilt" and "numfloors":
year_floors = ny[['yearbuilt', 'numfloors']]
pandas.core.frame.DataFrame
We will also add a column that will make the calculation to round our years down to 10:
import warnings
warnings.filterwarnings('ignore')
year_floors['r10'] = (year_floors['yearbuilt'] // 10 * 10).astype(int)
year_floors.head(10)
| yearbuilt | numfloors | r10 | |
|---|---|---|---|
| 0 | 1899.0 | 3.0 | 1890 |
| 1 | 1965.0 | 3.0 | 1960 |
| 2 | 1920.0 | 2.0 | 1920 |
| 4 | 1931.0 | 2.0 | 1930 |
| 5 | 1928.0 | 5.0 | 1920 |
| 21 | 2017.0 | 3.0 | 2010 |
| 22 | 1910.0 | 2.0 | 1910 |
| 28 | 1910.0 | 2.0 | 1910 |
| 31 | 1945.0 | 1.0 | 1940 |
| 36 | 1910.0 | 3.0 | 1910 |
We will create bins below in order to use them in our pandas.cut function, which will be utilized to transform our data. I have created 11 bins to group our buildings by numer of floors for each "r10" years, it will count the number of bins per "r10" and we will also fill in 0s for n/a values:
bins = (0, 1, 2, 3, 10, 20, 30, 40, 50, 60, 105)
year_floors_grouped = year_floors.groupby(['r10', pd.cut(year_floors['numfloors'], bins)]) \
.count() \
.drop(['numfloors'], axis=1) \
.fillna(0) \
.reset_index()
year_floors_grouped.head(20)
| r10 | numfloors | yearbuilt | |
|---|---|---|---|
| 0 | 1850 | (0, 1] | 4 |
| 1 | 1850 | (1, 2] | 66 |
| 2 | 1850 | (2, 3] | 845 |
| 3 | 1850 | (3, 10] | 638 |
| 4 | 1850 | (10, 20] | 1 |
| 5 | 1850 | (20, 30] | 0 |
| 6 | 1850 | (30, 40] | 0 |
| 7 | 1850 | (40, 50] | 0 |
| 8 | 1850 | (50, 60] | 0 |
| 9 | 1850 | (60, 105] | 0 |
| 10 | 1860 | (0, 1] | 9 |
| 11 | 1860 | (1, 2] | 92 |
| 12 | 1860 | (2, 3] | 830 |
| 13 | 1860 | (3, 10] | 612 |
| 14 | 1860 | (10, 20] | 3 |
| 15 | 1860 | (20, 30] | 0 |
| 16 | 1860 | (30, 40] | 0 |
| 17 | 1860 | (40, 50] | 0 |
| 18 | 1860 | (50, 60] | 0 |
| 19 | 1860 | (60, 105] | 0 |
Next, we create a "groups" variable, which will group the transformed data by number of floors. In the code below, notice how we use "yearbuilt" as the y-axis in our graph function, this is because the "yearbuilt" column became the counter for the number of buildings with a specific number of floors (bins) in each "r10" years as shown above.
groups = year_floors_grouped.groupby('numfloors')
fig = go.Figure()
for g in groups.groups:
group = groups.get_group(g)
fig.add_trace(go.Bar(x=group['r10'], y=group['yearbuilt'], name=str(g)))
fig.update_layout(barmode='stack', xaxis={'categoryorder':'category ascending'})
fig.show()
If we place the cursor over the graph we can see that we have some limited yet helpful interactivity where it tells us the number of bins for each color in each year.
Datashader is a library from Anaconda that does away with the need for binning data. It takes in all of your datapoints, and based on the canvas and range returns a pixel-by-pixel calculations to come up with the best representation of the data. In short, this completely eliminates the need for binning your data.
As an example, lets continue with our question above and look at a 2D histogram of YearBuilt vs NumFloors:
yearbins = 200
floorbins = 200
yearBuiltCut = pd.cut(ny['yearbuilt'], np.linspace(ny['yearbuilt'].min(), ny['yearbuilt'].max(), yearbins))
numFloorsCut = pd.cut(ny['numfloors'], np.logspace(1, np.log(ny['numfloors'].max()), floorbins))
xlabels = np.floor(np.linspace(ny['yearbuilt'].min(), ny['yearbuilt'].max(), yearbins))
ylabels = np.floor(np.logspace(1, np.log(ny['numfloors'].max()), floorbins))
fig = go.FigureWidget(
data = [
go.Heatmap(z = ny.groupby([numFloorsCut, yearBuiltCut])['bbl'].count().unstack().fillna(0).values,
colorscale = 'Greens', x = xlabels, y = ylabels)
]
)
fig.show()
This shows us the distribution, but it's subject to some biases discussed in the Anaconda notebook Plotting Perils.
Here is what the same plot would look like in datashader:
cvs = ds.Canvas(800, 500, x_range = (ny['yearbuilt'].min(), ny['yearbuilt'].max()),
y_range = (ny['numfloors'].min(), ny['numfloors'].max()))
agg = cvs.points(ny, 'yearbuilt', 'numfloors')
view = tf.shade(agg, cmap = cm(Greys9), how='log')
export(tf.spread(view, px=2), 'yearvsnumfloors')
That's technically just a scatterplot, but the points are smartly placed and colored to mimic what one gets in a heatmap. Based on the pixel size, it will either display individual points, or will color the points of denser regions.
Datashader really shines when looking at geographic information. Here are the latitudes and longitudes of our dataset plotted out, giving us a map of the city colored by density of structures:
NewYorkCity = (( 913164.0, 1067279.0), (120966.0, 272275.0))
cvs = ds.Canvas(700, 700, *NewYorkCity)
agg = cvs.points(ny, 'xcoord', 'ycoord')
view = tf.shade(agg, cmap = cm(inferno), how='log')
export(tf.spread(view, px=2), 'firery')
Interestingly, since we're looking at structures, the large buildings of Manhattan show up as less dense on the map. The densest areas measured by number of lots would be single or multi family townhomes.
Unfortunately, Datashader doesn't have the best documentation. Browse through the examples from their github repo. I would focus on the visualization pipeline and the US Census Example for the question below. Feel free to use my samples as templates as well when you work on this problem.
You work for a real estate developer and are researching underbuilt areas of the city. After looking in the Pluto data dictionary, you've discovered that all tax assessments consist of two parts: The assessment of the land and assessment of the structure. You reason that there should be a correlation between these two values: more valuable land will have more valuable structures on them (more valuable in this case refers not just to a mansion vs a bungalow, but an apartment tower vs a single family home). Deviations from the norm could represent underbuilt or overbuilt areas of the city. You also recently read a really cool blog post about bivariate choropleth maps, and think the technique could be used for this problem.
Datashader is really cool, but it's not that great at labeling your visualization. Don't worry about providing a legend, but provide a quick explanation as to which areas of the city are overbuilt, which areas are underbuilt, and which areas are built in a way that's properly correlated with their land value.
We will first have to calculate the assessment of the building on the land as it seems to be missing from the data. However, we can add this extra column by substracting the assessment of the land from the total assessment as such:
ny['assesssbld'] = ny['assesstot'] - ny['assessland']
We can then also create a separate dataframe that only contains the columns we will need for our datashader analysis:
data_shader = pd.DataFrame(ny[['assessland', 'assesssbld', 'xcoord', 'ycoord']])
data_shader.describe()
| assessland | assesssbld | xcoord | ycoord | |
|---|---|---|---|---|
| count | 8.106760e+05 | 8.106760e+05 | 8.105380e+05 | 810538.000000 |
| mean | 1.101640e+05 | 4.308225e+05 | 1.006350e+06 | 191354.612554 |
| std | 4.076812e+06 | 7.315718e+06 | 3.255027e+04 | 30511.749150 |
| min | 0.000000e+00 | 0.000000e+00 | 9.131640e+05 | 120966.000000 |
| 25% | 1.038000e+04 | 2.550000e+04 | 9.896700e+05 | 168057.000000 |
| 50% | 1.416000e+04 | 3.810000e+04 | 1.009035e+06 | 189086.500000 |
| 75% | 2.184000e+04 | 7.290000e+04 | 1.029534e+06 | 210838.000000 |
| max | 3.211276e+09 | 4.029011e+09 | 1.067279e+06 | 272275.000000 |
Lastly, we will add categorical values to our dataframe that will tell us whether the assessment (value) of the land and building are either "low", "med" or "high".
I will arbitrarily choose what I consider to be "low" and "high" numbers for the value of land and buildings based on our statistics above:
data_shader['land_value'] = 'med'
data_shader.loc[data_shader['assessland'] < 10000, 'land_value'] = 'low'
data_shader.loc[data_shader['assessland'] >= 21000, 'land_value'] = 'high'
data_shader['bld_value'] = 'med'
data_shader.loc[data_shader['assesssbld'] < 25000, 'bld_value'] = 'low'
data_shader.loc[data_shader['assesssbld'] >= 72000, 'bld_value'] = 'high'
data_shader['totval'] = pd.Categorical(data_shader['land_value'] + '-' + data_shader['bld_value'])
data_shader.head()
| assessland | assesssbld | xcoord | ycoord | land_value | bld_value | totval | |
|---|---|---|---|---|---|---|---|
| 0 | 24900.0 | 146520.0 | 996992.0 | 234157.0 | high | high | high-high |
| 1 | 15960.0 | 56700.0 | 984030.0 | 159620.0 | med | med | med-med |
| 2 | 12900.0 | 61800.0 | 982473.0 | 158966.0 | med | med | med-med |
| 4 | 5460.0 | 54600.0 | 996091.0 | 164469.0 | low | med | low-med |
| 5 | 229950.0 | 836550.0 | 997599.0 | 220113.0 | high | high | high-high |
We will also assign a color to our "totval" categorical column as based on the bivariate choropleth maps on the blog suggested above by the professor. I have chosen to go with the far right bivariate choropleth map on the image below:
colors = {
'low-low': '#e8e8e8',
'low-med': '#cbb8d7',
'low-high': '#9972af',
'med-low': '#e4d9ac',
'med-med': '#c8ada0',
'med-high': '#976b82',
'high-low': '#c8b35a',
'high-med': '#af8e53',
'high-high': '#804d36'
}
Now we plot our data points using datashader:
NewYorkCity = ((913164.0, 1067279.0), (120966.0, 272275.0))
cvs = ds.Canvas(700, 700, *NewYorkCity)
agg = cvs.points(data_shader, 'xcoord', 'ycoord', ds.count_cat('totval'))
view = tf.shade(agg, color_key = colors)
export(tf.spread(view, px=1), 'firery')
From our map above, we can see that Manhattan seems to have high land and high building values as we can see that dark brown color (high, high), which correlate. Another area that seems to be built in a way that is properly correlated to their land value is Staten Island. You can clearly see it is mostly greyish and light mustard, which suggest low-low and low-med values of land and buildings. It is really hard to spot any areas with dark purple or dark mustard colors, which would suggest low-high or high-low values. However, we can see that some areas in Queens and Brooklyn may approximate low-med and low-high values, which would mean they are overbuilt. On the other hand we can also see some parts of Queens and Brooklyn that are further away from the city that approximate a dark mustard color which would suggest a high-low value, meaning they are underbuilt.